↑
ML Questions
1. ML Foundamentals
1.1 Datasets
1.1.1. Data Preparation Process
- Data Collection with random or stratified sampling
- Data Cleaning, include check for missing or duplicate data, identify outliers, remove irrelevant information and correct any errors
- Data Labeling
- Data Spliting into training, validation, and test sets; training set for weight learning, validation set for hyperparameters tuning, test set for performance evaluation
- Data Preprocessing with normalization, scaling or transforming
- Balance Checking
- Data Shuffling to reduce bias and eliminate order-based information
1.1.2. Verify Collected Data is Suitable?
- Quantity
- Quality: missing value, duplicacy, outliers, irrelevant information, error
1.1.3. Handle Label Imbalance
- Resampling:
- Oversampling the minority classes by duplicating, while
- Downsampling the majority classes by removing
- Synthetic data generation: creating synthetic samples
- Cost-sensitive learning: Assigning different sampling weights
1.1.4. Handle Missing Labels
- Data Annotation: expensive and time consuming
- Remove missing labels: loss of information
- Impute missing labels with mean or mode, introduce bias
- Induction:
- train a model on the available data and then using them to predict missing labels.
- Transduction such as clustering
1.2 Featuers
- Numerical features: age, height
- Categorical features: ID, Gender
1.2.2 What is Feature Selection / Importance
- Feature selection: identify the most relevant features
- Feature importance assign a numerical value or score that indicates the contribution of each feature
1.2.3 How to Feature Selection
- Statistical tests such as chi-squared test, ANOVA
- training a model with different subsets of features
1.2.4 Handle Missing Values
- Deletion: remove the instances that contain missing values.
- Mean/Median/Mode
- Interpolation for time-series data
- kNN
- Regression
1.3 Modeling
1.3.1 Common modeling algorithms
- Logistic regression
- the features are combined linearly and then transformed using the logistic/softmax function
- Loss functions: Log Loss, or Cross-Entropy loss.
- regularization: using L1 and L2
- Pros: Simple and efficient; easily interpreted
- Cons: Modeling linear relationship; overfitting if #feature >> #sampels
- Deep neural networks (DNN)
- consists of layers of interconnected neurons
- Loss functions: MSE, Cross-Entropy
- regularization (avoid overfitting): L1 or L2 regularization, dropout, early stopping
- Pros: learns nonlinearity between inputs and outputs; scalability; adaptability to various features and label types
- Cons: computational expensive; data demanding; difficult to interpret
1.3.2 Common loss functions
- Cross-Entropy/Log loss: classification
- MAE/MSE: regression
1.3.3 Second-Order Optimization Methods
- more efficient than first-order optimization methods, as they have a better estimate of the curvature of the objective function
- Downside: higher computational cost (Hessian matrix); Memory requirements
1.3.4 Common Optimization Algorithms
- SGD, calc gradient with single random sample
- Mini-batch, calc gradient with a small batch of samples
- Momentum, calc ... with current and previous weight update
- Root Mean Squared Propagation (RMSProp)
- Adam
1.3.5 Hyperparameter Tuning
- Methods:
- Grid search with a predefined set of hyperparameters are evaluated exhaustively
- Bayesian optimization leverages a probabilistic model to approximate the relationship between hyperparameters and the model performance
- Genetic algorithm: combinations of hyperparameters (“genes”), uses “mutation” or “crossover" to selecting the fittest hyperparameters
- Hyperparameters to Tune
- logistic regression: solver, Learning rate, regularization terms
- DNNs: all above + #hidden layers, Embedding size, Batch size
1.3.6 Overfitting
- what: trained model is to fit the training data too closely
- detect: Accuracy gap between train and validation set; rapid decrease in training loss while slow decrease in validation loss;
- methods:
- more training data;
- regularization/drop (randomly sets unit outputs in a given layer to zero.);
- Reduce model capacity by reducing the number of input variables, the number of layers, and the size of each layer.
- Early stopping
- Batch normalization after ReLU
1.3.7 Regularization (Grab Inteview Question on 2019.08)
- prevent overfitting
- L1 (Lasso) regularization:
- adds a penalty term to the loss function proportional to the absolute value of the weights.
- used when the objective is to reduce the number of features the model uses (such as redundant features)
- L2 (Ridge) regularization:
- adds a penalty term to the loss function proportional to the square of the values of the weights.
- used to reduce the magnitude of the weights, spreading them
out more evenly
- L1 make weights sparse while L2 does not
- gradient of L1 is either -1 or 1 except where the weight is zero, so the penalty moves the value closer to zero by the same increment
- the gradient of L2 is a linear function that decreases to zero as the weight approaches zero
1.3.8 Linear and Logistic Regression
- Linear regression models a linear relationship between a dependent variable and one or more independent variables by fitting a best line
- minimizes the sum of the squared errors between the
predicted and actual values in the data
- Logistic regression models the probability of a binary outcome by fitting a logistic curve
- maximizes the likelihood of the observed data by minimizing the negative log-likelihood (Log Loss or
Cross-Entropy)
1.3.9 Activation Functions
- Softmax
- Converts raw scores into probabilities in multi-class classification
- Not used in hidden layers.
- Sigmoid
- Maps input to the range of 0 to 1
- vanishing gradients: as the function becomes saturated with large or small values.
- Not zero-centered, gradient updates always have the same sign, result in slower convergence
- Tanh:
- Maps input to the range of -1 to 1
- vanishing gradients, similar to sigmond
- zero-centered: won't suffer from zigzagging gradient updates
- ReLU (Rectified Linear Unit)
- f(x)=max(0,x)
- no vanishing gradient
- dying ReLU problem: neurons can become inactive if inputs are always negative
- Leaky ReLU
- f(x)={xif x>0αxif x≤0,α small (e.g. 0.01)
- Fixes dying ReLU problem.
- α needs to be tuned
- When choosing an activation function, consider its
- (1) zero-centricity,
- (2) computational cost, and
- (3) gradient anomalies such as vanishing or exploding gradients
1.3.10 Difference between boosting and bagging (Grab Glassdoor interview 2019.07)
- Boosting and bagging are emsemble methods to improve the accuracy of predictive models
- Boosting is a sequential process where models are built iteratively; more prone to overfit
- Bagging building multiple models in parallel on a different subset of the training data; more robust to outliers and noise
1.3.11 Vanishing and Exploding Gradient
- from chain rule in backpropagation, gradients are multiplied together
- if they are < 1, the product shrinks; otherwise it grows
- Vanishing Gradients
- What?
- Gradients shrink exponentially as they propagate backward
- commonly occured with sigmoid, tanh, in RNNs
- how?
- ReLU or its variants
- Batch normalization
- Gradient clipping
- Exploding Gradients
- What?
- Gradients grow exponentially
- Model weights blow up, loss becomes NaN or infinity.
- Very deep networks can occur: Each layer's gradient is a product of the previous layers' gradients
- how?
- Batch/Layer normalization
- Gradient clipping
1.3.11 Unsupervised Learning
- k-mean, Principal Component Analysis
1.4 Evaluation
1.4.1 Determine whether a loss function is convex
- First-order condition: gradient is a monotonically increasing
- Second-order condition: second derivative of the function is always positive
1.4.2 Evaluation Metrics
- Classifier models
- Accuracy: correct predictions among all predictions
- Precision: true positives among all positive predictions
- Recall: true positive among all actual positive cases
- F1 Score: Harmonic mean of Precision and Recall.
- ROC: Receiver Operating Characteristic curve, FPR vs TPR
- AUC: Area Under the ROC Curve, quantifies the ability of the model to distinguish between the two classes that it is predicting
- Regression models
- Mean Absolute Error (MAE): average absolute difference between the predicted and actual values
- Mean Squared Error (MSE): mean of the squared differences between the predicted and actual values.
- Root Mean Squared Error (RMSE): Square root of the mean of the squared differences.
1.4.3 Why some metrics are not used for optimize a model
- some not differentiable, e.g. accuracy or F1
- some non-convex
- Define the problem: non-converging loss?
- Visualize the data: Plot the data to check for patterns, outliers,corruption, and other issues that could be affecting model performance
- Inspect the model state at each layer for several steps
- verify loss computation is correct, learning rate, gradient computation, layer weights
2. ML System Design
---
config:
theme: redux
---
flowchart TB
Pre --> Design
subgraph Pre
cp["`**Clarify Problem**<i>
(1) Why
(2) what metrics
(3) what contents
(4) how to blend contents
(5) operational parameters of RecSys
</i>
`"]-->hd["`**High Level Design**<i>
(1) High-level design
(2) cold-start
</i>
`"]
end
subgraph Design
DC["`**Data Collection & Preprocessing**<i>
</i>
`"]
FE["`**Featuer Engineering**<i>
</i>
`"]
MEM["`**Modeling & Evaluation Metrics**<i>
</i>
`"]
DE["`**Deployment**<i>
</i>
`"]
DC --> FE --> MEM --> DE
end
2.1 Design Framework
- Clarify the problem: Why to do? Who are end users? What metrics? Size of dataset; latency requirements
- High-level design: identify inputs, intermediate stages, and outputs
- in RecSys: candidate fetch -> filters -> pre-ranking -> full ranking -> reranking
- in IESys: content classification -> entity extraction -> entity resolution
-
pipeline for an Information Extraction (IE) System
- (1) Content Classification: Categorizes text into predefined types (e.g., news vs. blog, legal vs. medical)
- Example: Classifying an email as a "job offer"
- (2) Entity Extraction: Identifies key pieces of information (named entities) in the text, e.g. People, locations, organizations, dates, amounts.
- Example: From the sentence “Alice works at Google,” extract: Person: Alice, Organization: Google
- (3) Entity Resolution: Links or de-duplicates entities across documents or mentions.
- Example: Ensure that "Dr. Alice Smith", "A. Smith", and "Alice" all refer to the same person.
- Data Collection & Processing
- Feature engineering
- Modeling & evaluation
- Deployment & serving
2.2 Clarify the Problem (RecSys)
2.2.1 Why recommend content to users
- Relevance: Personalized Experience
- Discoverability: save the time and effort from searching
- Increased engagement
- Increased revenue
- Better understanding of customer behavior to improve products and services
2.2.2 What are the metrics
- Customer:
- User retention: daily and monthly active users, etc
- click-through rate
- Revenue:
2.2.3 What kinds of contents
- Customer:
- Non-personalized content: based on geography; age; gender; trending
- Personalized content: consider personal traits and previous interactions
- Other Factors: diversity & serendipity
- Revenue/Platform: show ads to maximize revenue
- Engagement predictions (how the user will interact with the ad)
- Economic factors (bids, budgets)
- platform strategies.
2.2.4 How to blend contents (e.g., in-network, out-of-network, ads)?
2.2.5 what are operational parameters of RecSys
- Latency,
- Throughput,
- Number of candidates,
- Number of results
2.3 High-Level Design
2.3.1 the high-level design for RecSys
- Candidate generation: to produce hundreds or thousands of contents
- Filters: to eliminate 20% to 90% remaining candidates
- Pre-ranking: narrow down to a few hundred candidates
- Full-ranking: assign scores to contents
- Reranking: rerank candidates based on a variety of factors, such as diversity, freshness
2.3.2 cold start problem
- New User
- Data collection: gather user's information such as preferences during sign-up
- Collaborative filtering or non-personalized recommendations
- based on the feedback, use systematic approaches to engage user, such as the upper confidence bound algorithm
- Modelling for cold-start
- train a fallback model to recommend when insufficient data to use the the primary model
- force the primary model to dropout to recommend
2.4 Data Collection
2.4.1 What datasets and how to collect
- user-item interactions, such as clicks, purchases, follows
- contextual features, such as demographic data, device type, time of day
- negative sampling data, such as unclicked items, because user-item interaction matrix can be sparse and explict negative feedback is rare
- Uniform sampling: Randomly selects items that are not in the user's history
- Popularity-based sampling: Samples negative items based on their popularity
- Faraway negative sampling: Selects negative items that are dissimilar to positive items, or items with lower scores.
- for pre-training models, can collect the score from offline heavy ranker as a soft label
2.4.2 Possible biases and solution in the dataset
- Position bias: items at the top of the list are more
likely to be clicked
- randomly shuffling
- add positional features
- Presentation bias: presentation of the items influences the user's decision
- Trust bias: users have a high level of trust in the RecSys
- Serving/Algorithmic bias: this is the tendency of a system to recommend items similar to those that have historically performed well
- ϵ-Greedy: recommend items with the score with a probability of 1−ϵ, while recommend random items with a probabilty of ϵ
- UCB
2.5 Candidate Generation
2.5.1 Sources of Candidate Generation
- Non-personalized sources: Popular content, Trending, New Content, etc
- good for cold-start problem; less relevant
- Personalized in-network sources: In-network content (following business), Historical content (highly rated)
- Pros: highly relevant; Cons: less diversity
- Personalized out-of-network sources: collaborative filtering, content-based filtering
- Pros: highly personalized; Cons: limited diversity; suffer from coldstart
2.5.2 Steps in Candidate Generation
- Fetch sources: which are seeds for generating candidates, such as historical interacted items
- Generate candidates: with candidate generation algorithms
- Filter Candidates: by merging and pruning
2.5.3 Algorithms for Candidate Generation
- Neighborhood-based methods: Pearson Correlation
- Graph-based methods
- Latent methods: matrix factorization and DNNs to capture latent relationships
2.5.4 How to merge and prune candidates to limit #Candidates
2.5.5 Why not rank items in candidate generation
- Multiple candidate generators might be used. but they are not comparable
- Scoring function in candidate generation is not tailored for ranking task, while ranking task requires more complex models and features
2.5.6 how to handle in large-scale system, not every candidate can be scored in real-time
- approximate neareat neighbors
- pre-computing
2.6 Pre-Ranking
2.6.1 What is pre-ranking (light-ranking)
- a simplied version of a heavy (full) ranking model
- use less features
- use output of full model as targets
- quickly rank candidates in advance
- used when latency of RecSys is critical
2.6.2 Evaluation Metrics for Pre-Ranking Model
- Top-k recall: the ability to identify top-k candidates
- Expected Top-k recall: smoother version of top-k
2.6.3 Algorithms for Pre-Ranking Model
- Two-Tower architecture: 1 for target features and other other for candidate features
- lightweight DNNs
2.6.4 How to handle Pre-rank in large scale
2.7 Featuer Engineering
2.7.1 What features to use
- Target Features: describing the user
- Candidate features: describing items
- Crossing features: describing the interaction between user and items, can be fulfilled by deep crossing
2.7.2 How to handle textual or id-based features
- one-hot encoding for low cardinality
- hashing trick for high cardinality by applying a hash function to the features
- embedding: represent categorical features as continuous-valued vectors to capture relationships between input values
2.7.3 How to handle counting features
- Log transformation by applying logarithm function
- Bucketing
- Clipping
2.8 Modeling & Evaluation
2.8.1 What the heavy ranking model learns?
- Learn to rank
- Pointwise: assign a single score for each candidate
- Pairwise: distinguish between preferred and non-preferred candidates; computationally expensive
- Listwise: consider entire list instead of just pairwise comparisons or single-item predictions; require a significant amount of training data and computational resources
2.8.2 Algorithms for full-ranking model?
- For point-wise models
- DNNs
- Multi-task learning
- Transformers, Foundation Models
2.8.3 Evaluation Metrics
- pointwise metrics
- AUC (Area under precision-recall curve), most popular, quantifies the ability of the model to distinguish between the classes that it is predicting
- Average/Squared Error
- ranking metrics
- Mean Average Precision (MAP)
- infrastructure metrics
- latency
- computational cost
- product metrics collected via A/B testing
- positive engagement
- retention
- revenue
- conversion rate